========================================================
## user_id product_id gender age occupation city_category
## 1 1000001 P00069042 F 0-17 10 A
## 2 1000001 P00248942 F 0-17 10 A
## 3 1000001 P00087842 F 0-17 10 A
## 4 1000001 P00085442 F 0-17 10 A
## 5 1000002 P00285442 M 55+ 16 C
## 6 1000003 P00193542 M 26-35 15 A
## stay_in_current_city_years marital_status product_category_1
## 1 2 0 3
## 2 2 0 1
## 3 2 0 12
## 4 2 0 12
## 5 4+ 0 8
## 6 3 0 1
## product_category_2 product_category_3 purchase
## 1 NA NA 8370
## 2 6 14 15200
## 3 NA NA 1422
## 4 14 NA 1057
## 5 NA NA 7969
## 6 2 NA 15227
Dataset of 537 577 observations about the transactions made on black Friday in a retail store
## [1] 537577 12
the dataset contain 537577 observations and 12 variables
## 'data.frame': 537577 obs. of 12 variables:
## $ user_id : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
## $ product_id : Factor w/ 3623 levels "P00000142","P00000242",..: 671 2375 851 827 2733 1830 1744 3319 3597 2630 ...
## $ gender : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
## $ age : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
## $ occupation : int 10 10 10 10 16 15 7 7 7 20 ...
## $ city_category : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
## $ stay_in_current_city_years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
## $ marital_status : int 0 0 0 0 0 0 1 1 1 1 ...
## $ product_category_1 : int 3 1 12 12 8 1 1 1 1 8 ...
## $ product_category_2 : int NA 6 NA 14 NA 2 8 15 16 NA ...
## $ product_category_3 : int NA 14 NA NA NA NA 17 NA NA NA ...
## $ purchase : int 8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
## user_id product_id gender age
## Min. :1000001 P00265242: 1858 F:132197 0-17 : 14707
## 1st Qu.:1001495 P00110742: 1591 M:405380 18-25: 97634
## Median :1003031 P00025442: 1586 26-35:214690
## Mean :1002992 P00112142: 1539 36-45:107499
## 3rd Qu.:1004417 P00057642: 1430 46-50: 44526
## Max. :1006040 P00184942: 1424 51-55: 37618
## (Other) :528149 55+ : 20903
## occupation city_category stay_in_current_city_years
## Min. : 0.000 A:144638 0 : 72725
## 1st Qu.: 2.000 B:226493 1 :189192
## Median : 7.000 C:166446 2 : 99459
## Mean : 8.083 3 : 93312
## 3rd Qu.:14.000 4+: 82889
## Max. :20.000
##
## marital_status product_category_1 product_category_2 product_category_3
## Min. :0.0000 Min. : 1.000 Min. : 2.00 Min. : 3.0
## 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 9.0
## Median :0.0000 Median : 5.000 Median : 9.00 Median :14.0
## Mean :0.4088 Mean : 5.296 Mean : 9.84 Mean :12.7
## 3rd Qu.:1.0000 3rd Qu.: 8.000 3rd Qu.:15.00 3rd Qu.:16.0
## Max. :1.0000 Max. :18.000 Max. :18.00 Max. :18.0
## NA's :166986 NA's :373299
## purchase
## Min. : 185
## 1st Qu.: 5866
## Median : 8062
## Mean : 9334
## 3rd Qu.:12073
## Max. :23961
##
## [1] 5891
let’s use some plots to represent the data
this Table represents the main Dataset grouped by user_id
## # A tibble: 6 x 8
## # Groups: user_id, gender, age, occupation, city_category,
## # stay_in_current_city_years [6]
## user_id gender age occupation city_category
## <int> <fctr> <fctr> <int> <fctr>
## 1 1000001 F 0-17 10 A
## 2 1000002 M 55+ 16 C
## 3 1000003 M 26-35 15 A
## 4 1000004 M 46-50 7 B
## 5 1000005 M 26-35 20 A
## 6 1000006 F 51-55 9 A
## # ... with 3 more variables: stay_in_current_city_years <fctr>,
## # marital_status <int>, total_purchases <int>
## user_id gender age occupation city_category
## Min. :1000001 F:1666 0-17 : 218 Min. : 0.000 A:1045
## 1st Qu.:1001518 M:4225 18-25:1069 1st Qu.: 3.000 B:1707
## Median :1003026 26-35:2053 Median : 7.000 C:3139
## Mean :1003025 36-45:1167 Mean : 8.153
## 3rd Qu.:1004532 46-50: 531 3rd Qu.:14.000
## Max. :1006040 51-55: 481 Max. :20.000
## 55+ : 372
## stay_in_current_city_years marital_status total_purchases
## 0 : 772 Min. :0.00 Min. : 44108
## 1 :2086 1st Qu.:0.00 1st Qu.: 234914
## 2 :1145 Median :0.00 Median : 512612
## 3 : 979 Mean :0.42 Mean : 851752
## 4+: 909 3rd Qu.:1.00 3rd Qu.: 1099005
## Max. :1.00 Max. :10536783
##
The Purchase plot shows a relatively symmetrical distribution with a peak around $7000.I notice some gaps at different purchase levels
we have more male than female customers the city category C is the one with the most customers the 26-35 year old range is the hignest among the customers
the Number of non married customers is higher than the married ones
Most users made purchases less than $5million, but we notice some outliers that made purchases over 7.5 and 10 million
## # A tibble: 2 x 2
## gender total_purchases
## <fctr> <dbl>
## 1 F 1164624021
## 2 M 3853044357
Some products are more popular than others
The data set used has 537577 observations and 12 variables. Each user (which is a customer of the store) is represented by a user_id, age, gender, occupatio, city_category, stay_in_current_city_years, marital_status, and for each transaction we have the product_id, the product_category(A,B,C), and the purchases.
we notice that the most users are from the city category C, but most purchases are from the city category B.
Most users made purchases less than $5 million, but we notice some users (outliers) that made purchases over 7.5 and 10 million dollars
the number of Male shoppers is higher than female shoppers the number of non married shoppers is higner than married shoppers
the dataset can help us predict what profile of users will spend more and on which category of article. The combination of the variables( age, gender,and occupation) can help us with that
marital status, city, time spent in the city and the occupation are other feature that should be explored to help with main interest of the dataset exploration
no new variables was created.
I grouped the users by user_id in order to study the users and their features get the proportion of users and the total purchases per user
The 26-35 age bracket is the one with the highest amount of purchases. The 0-17 and 55+ are the one with the smallest amount of purchases. we can also notice some outliers, but in general the median amount of purchases is the same among all age brackets
The difference in total purchases between male and female is not very significant, but we notice much more extremely high purchases among male than female
City C has the lowest level of purchases, it’s median is the lowest among the 3 cities. In city A and city B about 75% of the purchases were about the same amount, but city A has a lot more outliers with very high total purchases.The second plot shows that city B has the highest total amount of purchases and city A has the lowest
Occupations up to category 7 have made the most purchases
Non married customers made much more purchases than married customers
There are more single male than female, and also more married male than female. it might explain why male made more purchases than female, usually singles can afford to spend more than married customers, but it’s worth more exploration
The 26-35 and 36_45 are the busiest brackets, the hold occupation from all categories
let’s explore the relationship between the age and the city_category
##
## A B C
## 0-17 25 50 143
## 18-25 214 331 524
## 26-35 461 652 940
## 36-45 176 335 656
## 46-50 53 146 332
## 51-55 67 135 279
## 55+ 49 58 265
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 164.34, df = 12, p-value < 2.2e-16
The p_value of the Pearson’s Chi_squared test is very small ( < 0.05), so there is clearely a relationship between the age and the city_category.
The city A has a population mainly between 18 and 45 years old.
let’s explore the relationship between the age and the stay_in_current_city
##
## 0 1 2 3 4+
## 0-17 31 81 47 33 26
## 18-25 145 357 215 175 177
## 26-35 266 729 384 346 328
## 36-45 150 397 237 216 167
## 46-50 75 192 102 83 79
## 51-55 61 204 78 60 78
## 55+ 44 126 82 66 54
##
## Pearson's Chi-squared test
##
## data: tbl1
## X-squared = 29.058, df = 24, p-value = 0.2179
Contrary to my expectation, there is not a strong relationship between the age and the number of year the person stayed in the same city.
now, let’s explore more about the products bought
## # A tibble: 6 x 10
## # Groups: product_id, product_category_1, product_category_2,
## # product_category_3, age, gender, city_category [6]
## product_id product_category_1 product_category_2 product_category_3
## <fctr> <int> <int> <int>
## 1 P00000142 3 4 5
## 2 P00000142 3 4 5
## 3 P00000142 3 4 5
## 4 P00000142 3 4 5
## 5 P00000142 3 4 5
## 6 P00000142 3 4 5
## # ... with 6 more variables: age <fctr>, gender <fctr>,
## # city_category <fctr>, marital_status <int>, total_purchases <int>,
## # nbre_purchases <int>
## product_id product_category_1 product_category_2
## P00265242: 78 Min. : 1.000 Min. : 2.00
## P00059442: 77 1st Qu.: 3.000 1st Qu.: 6.00
## P00085942: 77 Median : 5.000 Median :11.00
## P00086042: 77 Mean : 6.003 Mean :10.22
## P00251242: 77 3rd Qu.: 8.000 3rd Qu.:15.00
## P00028842: 76 Max. :18.000 Max. :18.00
## (Other) :120526 NA's :48524
## product_category_3 age gender city_category marital_status
## Min. : 3.0 0-17 : 6542 F:48274 A:36780 Min. :0.0000
## 1st Qu.: 9.0 18-25:20575 M:72714 B:45135 1st Qu.:0.0000
## Median :14.0 26-35:28236 C:39073 Median :0.0000
## Mean :12.7 36-45:23956 Mean :0.4789
## 3rd Qu.:16.0 46-50:16676 3rd Qu.:1.0000
## Max. :18.0 51-55:14583 Max. :1.0000
## NA's :95221 55+ :10420
## total_purchases nbre_purchases
## Min. : 186 Min. : 1.000
## 1st Qu.: 7864 1st Qu.: 1.000
## Median : 15920 Median : 2.000
## Mean : 41472 Mean : 4.443
## 3rd Qu.: 38769 3rd Qu.: 5.000
## Max. :2166488 Max. :124.000
##
Most products were bought less than 192 times, and very few were bought more than 1500 times
Most of the purchases per product are under $2.5 million
The age bracket 26_35 is the one with the biggest number of purchases
## [1] 0.90676
There is a strong correlation between numbre of purchases and the amount spend
I noticed that there are more buyers from city C, but the amount of purchases is higher in city B, this is worth exploring deeper. I also noticed that occupations up to category 7 have made the most purchases, I will explore more the relationship with the other variables.
I was expecting older people to stay longer in the same city, the data doesn’t show that. we also have more young customers in every city.
in each city the number of customers of age bracket: 26-35 year old is highest than the other brackets, it is also the bracket with the highest purchases, it is worth exploring if any other varibale is correlated to them
Customers in occupation 8 and between 46 and 50 year old are the ones who made the mots purchases
Male in general made more purchases than female and particularly male in the 26-35 year old bracket.
The 26-35 bracket is the one with the most purchases through the 3 cities, but particulariliy in city B and then A
3 main occupations(0,4) with very high purchases
Now, let’s see if single male made more purchases
The data shows that single male made more purchases
The last 2 plots show that male between 26 and 35 year old accross all three cities and occupation are the customers with the most purcahses
This plot shows that the 26-35 age bracket made the biggest number of purchases with high amounts spending
##
## Calls:
## m1: lm(formula = total_purchases ~ nbre_purchases, data = by_productid)
## m2: lm(formula = total_purchases ~ nbre_purchases + age, data = by_productid)
## m3: lm(formula = total_purchases ~ nbre_purchases + age + gender,
## data = by_productid)
## m4: lm(formula = total_purchases ~ nbre_purchases + age + gender +
## city_category, data = by_productid)
## m5: lm(formula = total_purchases ~ nbre_purchases + age + gender +
## city_category + marital_status, data = by_productid)
##
## ==============================================================================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) -8839.627*** -5588.955*** -3287.833*** -2850.852*** -2852.977***
## (124.981) (451.301) (464.332) (494.969) (494.647)
## nbre_purchases 11323.322*** 11506.218*** 11579.953*** 11594.525*** 11613.442***
## (15.138) (15.692) (16.074) (16.134) (16.193)
## age: 18-25/0-17 -5187.658*** -5190.914*** -5047.054*** -6205.428***
## (518.004) (517.111) (516.977) (524.744)
## age: 26-35/0-17 -11074.908*** -11382.109*** -11357.097*** -12781.103***
## (506.351) (505.700) (506.047) (518.175)
## age: 36-45/0-17 -3855.750*** -3821.783*** -3742.798*** -5044.605***
## (508.864) (507.990) (507.817) (517.882)
## age: 46-50/0-17 -342.182 7.043 138.222 -1542.343**
## (530.928) (530.287) (529.730) (545.906)
## age: 51-55/0-17 724.897 965.540 1135.432* -500.159
## (541.540) (540.734) (540.236) (555.248)
## age: 55+/0-17 1471.907* 2175.650*** 1852.037** 268.227
## (574.050) (574.091) (573.889) (587.109)
## gender: M/F -4494.113*** -4601.812*** -4662.282***
## (219.508) (219.679) (219.588)
## city_category: B/A -2641.229*** -2678.691***
## (257.176) (257.025)
## city_category: C/A 1522.495*** 1540.340***
## (266.230) (266.060)
## marital_status 2745.671***
## (217.760)
## ----------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.822 0.825 0.825 0.826 0.826
## adj. R-squared 0.822 0.825 0.825 0.826 0.826
## sigma 36640.158 36390.739 36328.010 36285.909 36262.240
## F 559527.698 81270.575 71409.945 57288.886 52163.269
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1443124.344 -1442294.932 -1442085.696 -1441944.403 -1441864.957
## Deviance 162423846693851.906 160212107672853.438 159658924624998.250 159286450786954.969 159077400095792.469
## AIC 2886254.688 2884607.864 2884191.392 2883912.805 2883755.914
## BIC 2886283.798 2884695.195 2884288.427 2884029.247 2883882.059
## N 120988 120988 120988 120988 120988
## ==============================================================================================================================================
I created the linear model, and the R-squared is significant. This model can be used to predict the amount spent and the number of purchases by customer. The variables used account for 82.6% in the total purchases of a customer
single male have the hignest amount of purchases. the 26-35 year old customers from all city categories have the hights amount of purchases. and the number of product bought was correlated to the amount spent.
there was no big suprises in this part of the investigation. but I was surrise that the variable occupation didn’t have that a big influence on the rest of the variables.
the linear model I created shows that the variables : age, gender, city_category, and marital status are very important in the prediction of the amount spent. It would have been very nice if I was able to predict the product id of the product that the customer would buy using the variables.
the plot shows the purchases made by each customer, the plot is skewed to the left, which shows that the bulk of the amounts spent is less than 2.5 million dollar. it also shows that there was some extremely high amount spent, they represent the outliers of this analysis
I like the fact that there is a correlation between the number of purchases and the amount spent, it shows that customers were buying a lot of products and not necessarily very expensive one. it can be an indiaction that there is not a lot of outliers
these 2 plots show that the purchases were made mainly by the 26-35 year old male , and this accross the 3 cities. some occupation were mode frequent than others, but doens’t really have a big influence on the amount of purhases. We can notice that cities A and B are the ones with the most purcahses. so these 2 plots resume it all and tell us what variables influenced the purchases. ——
it was very interesting to explore this dataset, the good part is that there was no cleaning to make on it. I am a big shopper so I enjoyed working on this particular dataset because I always wonder how big retail stores manage and use their data to explore their customers and their shopping habits. the difficult part was to find the right functions in R for a particular little action on a plot. we learn a lot in the classes, we absorb a lot of new information but it is very difficult to remember all of them and to find the right one to use when needed. I guess a lot practice is always needed to learn a new language. overall, I enjoyed working on this project! I wish I have more time to explore this data set further, and to explore the other side of it, which is what are the most popular products. it can be a good future project for when I finish my Nano degree.